##1. Kokybės kontrolė : metilinamo tankio funkcijos tarp CpG salų elementų
Kiekvienai pozicinei eilutei paskaičiuojamas tos eilutės mėginių vidurkis.

anno <- readRDS("../output/annotation.rds")
beta <- readRDS("../output/betab.rds")
data <- read.csv("../output/correctedSample.csv")

n<- length(rownames(beta))

#skaičiuoja vidurkius
means <- matrix(ncol=2, nrow=n)
for(i in 1:n){

  means[i,] <- c((rownames(beta)[i]), mean(beta[i,]))
}
colnames(means) <- c("name", "mean")
head(as.data.frame(means))
##         name               mean
## 1 cg01707559  0.171268794480463
## 2 cg02004872  0.117750665305373
## 3 cg02494853 0.0302013463252889
## 4 cg03244189  0.198499881618238
## 5 cg03706273 0.0299952362271422
## 6 cg04023335   0.34280657826654

#Kiekvienai salos grupei skaičiuoajamas CpG tankis.
Šiuo atveju yra 6 salos.

## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## [1] "Island"  "S_Shore" "OpenSea" "N_Shore" "N_Shelf" "S_Shelf"
## 
## Call:
##  density.default(x = relate)
## 
## Data: Island (27700605 obs.);    Bandwidth 'bw' = 0.004328
## 
##        x                  y           
##  Min.   :-0.01296   Min.   : 0.00001  
##  1st Qu.: 0.24352   1st Qu.: 0.22143  
##  Median : 0.50000   Median : 0.31227  
##  Mean   : 0.50000   Mean   : 0.97362  
##  3rd Qu.: 0.75647   3rd Qu.: 0.62755  
##  Max.   : 1.01295   Max.   :10.62007
## 
## Call:
##  density.default(x = relate)
## 
## Data: S_Shore (8972500 obs.);    Bandwidth 'bw' = 0.01231
## 
##        x                 y           
##  Min.   :-0.0368   Min.   :0.000015  
##  1st Qu.: 0.2316   1st Qu.:0.543589  
##  Median : 0.5000   Median :0.631717  
##  Mean   : 0.5000   Mean   :0.930500  
##  3rd Qu.: 0.7684   3rd Qu.:1.103510  
##  Max.   : 1.0368   Max.   :3.960065
## 
## Call:
##  density.default(x = relate)
## 
## Data: OpenSea (30277470 obs.);   Bandwidth 'bw' = 0.00602
## 
##        x                  y           
##  Min.   :-0.01803   Min.   :0.000001  
##  1st Qu.: 0.24099   1st Qu.:0.305274  
##  Median : 0.50000   Median :0.513307  
##  Mean   : 0.50000   Mean   :0.964245  
##  3rd Qu.: 0.75902   3rd Qu.:1.031196  
##  Max.   : 1.01803   Max.   :4.908815
## 
## Call:
##  density.default(x = relate)
## 
## Data: N_Shore (11468520 obs.);   Bandwidth 'bw' = 0.01167
## 
##        x                  y           
##  Min.   :-0.03477   Min.   :0.000013  
##  1st Qu.: 0.23265   1st Qu.:0.553297  
##  Median : 0.50007   Median :0.641694  
##  Mean   : 0.50007   Mean   :0.933947  
##  3rd Qu.: 0.76749   3rd Qu.:1.115567  
##  Max.   : 1.03491   Max.   :3.712139
## 
## Call:
##  density.default(x = relate)
## 
## Data: N_Shelf (4359155 obs.);    Bandwidth 'bw' = 0.007042
## 
##        x                  y           
##  Min.   :-0.01979   Min.   :0.000001  
##  1st Qu.: 0.24043   1st Qu.:0.230926  
##  Median : 0.50065   Median :0.397615  
##  Mean   : 0.50065   Mean   :0.959798  
##  3rd Qu.: 0.76087   3rd Qu.:1.071314  
##  Max.   : 1.02109   Max.   :5.282310
## 
## Call:
##  density.default(x = relate)
## 
## Data: S_Shelf (3895545 obs.);    Bandwidth 'bw' = 0.006787
## 
##        x                  y           
##  Min.   :-0.01769   Min.   :0.000001  
##  1st Qu.: 0.24179   1st Qu.:0.193128  
##  Median : 0.50127   Median :0.384813  
##  Mean   : 0.50127   Mean   :0.962524  
##  3rd Qu.: 0.76074   3rd Qu.:1.083690  
##  Max.   : 1.02022   Max.   :5.445911

Grafike pavaizduota kiekvienos salos(pagal spalvas legendoje) tankio pasiskirstymai.

X ašis grafike rodo tikimybę gauti x reikšmę.

colors <- c("red","yellow", "green", "gold", "lightblue" ,"blue" ) 
plot(island1, col = colors[1], main = "Relation_to_Islands", xlab = NA)
lines(island2, col = colors[2])
lines(island3, col = colors[3])
lines(island4, col = colors[4])
lines(island5, col = colors[5])
lines(island6, col = colors[6])
legend("top",
  c(islands),
  fill=c(colors)
)

2. Naudojamas hierarchinis klasterizavimas.

Plot (be mėginių vardų) pavaizduoja atstumų dydžius tarp mėginių.
Mėginiai užima labai daug vietos, todėl atvaizdavimas su mėginių pavadinimais nėra prasmingas,
nes medis tiesiog nesimato.

hc<-hclust(as.dist(1-cor(beta)), method = "complete")
plot(hc, xlab="Samples", main="Dissimilarity = 1 - Correlation", labels= FALSE)

## Loading required package: dynamicTreeCut
## Loading required package: fastcluster
## 
## Attaching package: 'fastcluster'
## The following object is masked from 'package:stats':
## 
##     hclust
## 
## 
## Attaching package: 'WGCNA'
## The following object is masked from 'package:S4Vectors':
## 
##     cor
## The following object is masked from 'package:stats':
## 
##     cor

#Dendograma mėginiams pavaizduoti.
Mėginių pavadinimai atitinkamai paversti į spalvas.
Mėginiai x ašyje yra sutampančios grupės. Jos susidarė dėl panašaus atstumo koreliuojant.
Y ašyje matomi panašūs atstumai - jie yra maži. Positions Nurodo pozicijas pagal spalvą( toliau yra legenda, kurioje nurodytos spalvų reikšmės). Taip pat parodyta lytis ir tiriamieji - sveikas asmuo ar sergantis.

head(data)
##    X            sample_name      slide    type source_name_ch1 organism_ch1
## 1 18 DLPFC_Alzheimer_Br5078 GSM3584161 genomic           DLPFC Homo sapiens
## 2 19   DLPFC_Control_Br1987 GSM3584162 genomic           DLPFC Homo sapiens
## 3 20   DLPFC_Control_Br1764 GSM3584163 genomic           DLPFC Homo sapiens
## 4 23 DLPFC_Alzheimer_Br5107 GSM3584166 genomic           DLPFC Homo sapiens
## 5 24 DLPFC_Alzheimer_Br5105 GSM3584167 genomic           DLPFC Homo sapiens
## 6 25   DLPFC_Control_Br1157 GSM3584168 genomic           DLPFC Homo sapiens
##   sample_plate  Array  brNum tissue tissue_region    sample age sex race
## 1      Plate 1 R05C02 Br5078  brain         DLPFC Alzheimer  64   M CAUC
## 2      Plate 1 R01C02 Br1987  brain         DLPFC   Control  54   M CAUC
## 3      Plate 1 R03C01 Br1764  brain         DLPFC   Control  56   M CAUC
## 4      Plate 1 R05C02 Br5107  brain         DLPFC Alzheimer  80   M CAUC
## 5      Plate 1 R01C02 Br5105  brain         DLPFC Alzheimer  76   M CAUC
## 6      Plate 1 R03C01 Br1157  brain         DLPFC   Control  60   M   AA
##   negcontrol_pc1 negcontrol_pc2  neun_pos     snppc1 molecule_ch1 label_ch1
## 1       5.649698     -3.8772291 0.2502178 -0.0553337  genomic DNA   Cy3/Cy5
## 2      -8.078151     -1.1767034 0.2337783 -0.0491986  genomic DNA   Cy3/Cy5
## 3     -14.318467     -1.1313009 0.2161111 -0.0489529  genomic DNA   Cy3/Cy5
## 4       4.431130     -4.2548791 0.2754533 -0.0523536  genomic DNA   Cy3/Cy5
## 5      -7.481998      9.2532100 0.3023755 -0.0556155  genomic DNA   Cy3/Cy5
## 6       5.891752     -0.3727747 0.1653021  0.0908303  genomic DNA   Cy3/Cy5
##   taxid_ch1
## 1      9606
## 2      9606
## 3      9606
## 4      9606
## 5      9606
## 6      9606
##                                                                                                                                                   description
## 1 Postmortem human brain tissue samples from four brain regions in an Alzheimers disease case-control series used to generate Human450k DNA methylation data.
## 2 Postmortem human brain tissue samples from four brain regions in an Alzheimers disease case-control series used to generate Human450k DNA methylation data.
## 3 Postmortem human brain tissue samples from four brain regions in an Alzheimers disease case-control series used to generate Human450k DNA methylation data.
## 4 Postmortem human brain tissue samples from four brain regions in an Alzheimers disease case-control series used to generate Human450k DNA methylation data.
## 5 Postmortem human brain tissue samples from four brain regions in an Alzheimers disease case-control series used to generate Human450k DNA methylation data.
## 6 Postmortem human brain tissue samples from four brain regions in an Alzheimers disease case-control series used to generate Human450k DNA methylation data.
##   description.1 platform_id
## 1         1-C02    GPL13534
## 2         1-C03    GPL13534
## 3         1-C04    GPL13534
## 4         1-C08    GPL13534
## 5         1-C09    GPL13534
## 6         1-C10    GPL13534
##                                                                                                                  Basename
## 1 C:/Users/asta1/Desktop/stud/6sem/biomeds/2darbas/input/GSE125895/supplement/GSE125895_idat/GSM3584161_9283265070_R05C02
## 2 C:/Users/asta1/Desktop/stud/6sem/biomeds/2darbas/input/GSE125895/supplement/GSE125895_idat/GSM3584162_9283265146_R01C02
## 3 C:/Users/asta1/Desktop/stud/6sem/biomeds/2darbas/input/GSE125895/supplement/GSE125895_idat/GSM3584163_9283265163_R03C01
## 4 C:/Users/asta1/Desktop/stud/6sem/biomeds/2darbas/input/GSE125895/supplement/GSE125895_idat/GSM3584166_9297949021_R05C02
## 5 C:/Users/asta1/Desktop/stud/6sem/biomeds/2darbas/input/GSE125895/supplement/GSE125895_idat/GSM3584167_9297949030_R01C02
## 6 C:/Users/asta1/Desktop/stud/6sem/biomeds/2darbas/input/GSE125895/supplement/GSE125895_idat/GSM3584168_9297949038_R03C01
##   sentrix_id                    sample_id
## 1 9283265070 GSM3584161_9283265070_R05C02
## 2 9283265146 GSM3584162_9283265146_R01C02
## 3 9283265163 GSM3584163_9283265163_R03C01
## 4 9297949021 GSM3584166_9297949021_R05C02
## 5 9297949030 GSM3584167_9297949030_R01C02
## 6 9297949038 GSM3584168_9297949038_R03C01
unique(data$source_name_ch1)
## [1] DLPFC ERC   HIPPO
## Levels: DLPFC ERC HIPPO
data$loc_to_num <- data$source_name_ch1
data$loc_to_num <- gsub("DLPFC", 1, data$loc_to_num)
data$loc_to_num <- gsub("ERC", 2, data$loc_to_num)
data$loc_to_num <- gsub("HIPPO", 3, data$loc_to_num)
data$loc_to_num <- as.integer(data$loc_to_num)


plotDendroAndColors(hc, labels2colors(data[,c("source_name_ch1", "sex", "sample")]), 
  main = "Sample dendrogram and trait heatmap",
  dendroLabels=data$loc_to_num,
  addGuide = TRUE,
  groupLabels = c("Position", "Gender", "Sample")
  )

cols <- data.frame(labels2colors(unique(data[,c("source_name_ch1", "sex", "sample")])), unique(data[,c("source_name_ch1", "sex", "sample")]))
col1 <- c(unique(as.character(cols$X1)), unique(as.character(cols$X2)), unique(as.character(cols$X3)))
col2 <- c(unique(as.character(cols$source_name_ch1)), unique(as.character(cols$sex)), unique(as.character(cols$sample)))

plot(NULL,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1) 
legend("topleft", legend = col2,
       fill = col1)

#install.packages("BiocManager") 
#BiocManager::install("WGCNA")

3. Naudojamas heatmap.

Surandami variabiliausi duomenys t.y. surikiuojami ir naudojami pirmieji 5000 pozicijų.
Heatmap pavaizduoja hierarchinį klasterizavimą, reikšmės pakeistos spalvomis. Raudona reiškia mažą reikšmę. Šviesesnė spalva reiškia didesnę.

rowvariances <- apply(beta,1,var)

orderedVarianceIndexes <- order(rowvariances,decreasing = TRUE)
heatmap(beta[orderedVarianceIndexes[1:5000], ], Colv = NA,
        xlab="Location", ylab="Sentrix Id", 
        labRow = data$sentrix_id, 
        labCol = data$source_name_ch1,
        main="heatmap")

4. Principinių komponenčių analizė.

Iš grfiko, galime matyti, kad daugiausia variacijų sudaro pirmos 3 komponentės. Visos kitos komponentės turi mažai variacijų.

components <- prcomp(t(beta))
screeplot(components, main = "Components")

Poruojant komponentes, kai kurios poros turi panašumų pvz (PC4 ir PC5), kai mažėja PC4, mažės ir PC5.

components$x[1:5,1:2]
##                                     PC1        PC2
## GSM3584161_9283265070_R05C02  -4.901724 18.2607528
## GSM3584162_9283265146_R01C02 -10.870001  0.2751873
## GSM3584163_9283265163_R03C01 -12.042775 -5.2408401
## GSM3584166_9297949021_R05C02 -11.140049  3.3627905
## GSM3584167_9297949030_R01C02  -9.025372 13.4253857
pairs(components$x[,1:5], col = labels2colors(data$source_name_ch1), main = "PCA-plot", pch = data$loc_to_num)

5. Pirmoji principinė komponentė turi dideles reikšmes, tik vėliau(žiūrint į apačią) smarkiai mažėja.

Raudona reiškia dideles reikšmes, geltona - mažas. Likusios komponentės yra panašesnės. Ne tokie dideli skirtumai kaip pirmoje komponentėje. Šitaip vaizduojant duomenų matricą, galima greičiau rasti kintamuosius. Šiuo atveju komponenčių ir mėginių pasiskirstymą.

heatmap(components$x[,1:5],  Colv = NA,
        labRow = data$sample)

#heatmap(components$rotation)
## Loading required package: plotly
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Interaktyvus heatmap, kuris nurodo mėginių ir komponenčių reikšmes.

x<-colnames(components$x)
y<- rownames(components$x)

p <- plot_ly(x=x, y=y, 
            z = components$x, 
            type = "heatmap", 
            colorscale= "blues"
            ) %>%
    layout(xaxis = list(title="Components"), yaxis = list(title="Samples"))
p